Concept-Aware Deep Knowledge Tracing and Exercise 
Recommendation in an Online Learning System 


Fangzhe Ai 
School of Electronics and 
Information Engineering 
Beijing Jiaotong University 
17125001 @bjtu.edu.cn 


Yongxiang Zhao 
School of Electronics and 
Information Engineering 
Beijing Jiaotong University 
yxzhao@bjtu.edu.cn 


ABSTRACT 


Personalized education systems recommend learning con- 
tents to students based on their capacity to accelerate their 
learning. This paper proposes a personalized exercise rec- 
ommendation system for online self-directed learning. We 
first improve the performance of knowledge tracing models. 
Existing deep knowledge tracing models, such as Dynamic 
Key-Value Memory Network (DKVMN), ignore exercises’ 
concept tags, which are usually available in tutoring sys- 
tems. We modify DKVMN to design its memory structure 
based on the course’s concept list, and explicitly consider 
the exercise-concept mapping relationship during students’ 
knowledge tracing. We evaluated the model on the 5th grade 
students’ math exercising dataset in TAL, one of the biggest 
education groups in China, and found that our model has 
higher performance than existing models. We also enhance 
the DKVMN model to support more input features and ob- 
tain higher performance. Second, we use the model to build 
a student simulator, and use it to train an exercise recom- 
mendation policy with deep reinforcement learning. Exper- 
imental results show that our policy achieves better perfor- 
mance than existing heuristic policy in terms of maximizing 
the students’ knowledge level. To the best of our knowl- 
edge, this is the first time that deep reinforcement learning 
has been applied to personalized mathematic exercise rec- 
ommendation. 


1. INTRODUCTION 


Online self-directed learning systems, such as Massive Open 
Online Courses (MOOCs), are prevailing. These systems, 
however, assign same exercises to all students, which is in- 
efficient. For comparison, personalized exercises recommen- 
dation can improve the efficiency of students’ learning. In 
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this paper, we propose a personalized exercise recommenda- 
tion system for an online self-directed learning service. The 
system consists of two parts: 


e A student knowledge tracing model, which traces a 
student’s knowledge state and predicts whether or not 
she can finish the exercise correctly. 


e A personalized exercise recommendation policy which 
recommends appropriate exercises to students to ac- 
celerate her learning process. 


Existing deep knowledge tracing models [9, 13] ignore exer- 
cises’ knowledge concept properties, which are usually avail- 
able in tutoring systems. For comparison, in this paper, we 
propose a concept-aware deep knowledge tracing model. The 
model is inspired by Dynamic Key-Value Memory Network 
(DKVMN) model [13]. DKVMN model has a static matrix 
called key which stores the latent knowledge concepts and 
a dynamic matrix called value which stores a student’s con- 
cept mastery levels. The model computes the correlation 
between an exercise and the latent concepts in the key, and 
then uses it to read the student’s concept mastery levels in 
the value, and predict whether the student will finish the ex- 
ercise correctly. We improve the DKVMN model as follows: 
1) we design its memory structure based on the course’s con- 
cept list and explicitly consider the exercise-concept map- 
ping relationship during students’ knowledge tracing. 2) We 
enhance it to support more input features, including exercise 
difficulty, stages, and student practice time. We evaluated 
the model on the 5th grade students’ math exercising dataset 
in TAL, and found that our model has higher performance 
than existing deep knowledge tracing models. 


In terms of personalized exercise recommendation policy, 
most of existing algorithms are heuristic, e.g., exercises which 
are too easy or too hard for a student should be avoided. 
These algorithms may be not optimal, as they only con- 
sider the short-term reward. In this paper, we build a stu- 
dent simulator with our concept-aware deep knowledge trac- 
ing model, and then use it to train a flexible and scalable 
personalized exercise recommendation policy with deep re- 
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inforcement learning, which considers long-term reward of 
recommended items. 


In summary, the main contributions of this paper are two 
folds: 


e We propose a new exercise-level deep knowledge trac- 
ing model whose structure is built based on the course’s 
concept list, and the exercise-concept mapping rela- 
tionships are utilized during students’ knowledge trac- 
ing. The model supports more input features and ob- 
tains higher performance compared with existing mod- 
els. 


e We propose an exercises recommendation algorithm 
which uses model-free reinforcement learning with neu- 
ral network function approximation to learn an exer- 
cise recommendation policy. The policy directly oper- 
ates on raw observations of a student’s exercise history. 
Experimental results show that our policy achieves 
better performance than existing heuristic policy in 
terms of maximizing students’ knowledge level. 


2. RELATED WORK 


Knowledge Tracing: The work in [3] proposed a Bayesian- 
based knowledge tracing model. It models a student’s status 
of a knowledge concept as a binary variable, and updates the 
probability of her mastering the concept according to her 
results of doing exercises through a Hidden Markov Model. 
This model is at the concept level, and ignores the relation- 
ship between different concepts. The work in [9] proposed a 
deep knowledge tracing (DKT) model with recurrent neural 
network. It models a student’s knowledge states as latent 
variables, and gets better performance than Bayesian-based 
model does [6]. The work in [12] proposed to improve DKT 
by considering exercises’ semantic features. The work in [13] 
tried to model the correlation between different latent con- 
cepts. Inspired by DKVMN, this paper proposes a model 
whose structure is explicitly built based on the course’s con- 
cept list, and the exercise-concept mapping relationship is 
utilized in the model. 


Exercise Recommendation: The work in [1] proposed that a 
student is recommended by an exercise, if the probability of 
her doing the exercise correctly is around 50%. The prob- 
lem of this algorithm is that the threshold 50% is heuristic 
and may be not optimal. The work in [2] allows experts 
to specify a ZPD (Zone of Proximal Development) based 
on current knowledge state of a student, and then chooses 
the most profitable exercise by multi-armed bandits algo- 
rithm. The algorithm can discover the characteristics of 
students through exploration but it is inefficient, because ev- 
ery student needs an independent exploration process. The 
work in [5] leverages a DKT model towards recommenda- 
tion, and frame the problem space using ZPD explicitly fa- 
cilitated by the DKT model. The work in [7] first estimates 
each student’s knowledge profile from their previous exer- 
cise results using SPARFA framework. Then, it uses these 
knowledge profiles as contexts and applies contextual ban- 
dits algorithm to recommend exercises, for maximizing a 
student’s immediate success, i.e., her performance on the 
next exercise. The problem of this algorithm is that it only 
considers the next step and thus its performance may be 


not optimal. The work in [10] evaluated a review schedul- 
ing algorithm for spaced repetition systems based on deep 
reinforcement learning. We are inspired by this work and 
evaluate the performance of deep reinforcement learning in 
our math self-directed learning system. 


3. BACKGROUND 


In this section, we introduce our online learning system and 
dataset. 


3.1 Intelligent Practice System (IPS) 

IPS is an online self-directed learning system developed by 
TAL Education Group, Inc. of China. In IPS, each course 
(e.g., the 5th grade math) has tens of units. Each unit in- 
cludes 7 stages, i-e., 1) warming-up exercises before class, 2) 
in-class exercises before lecture, 3) video lecture, 4) in-class 
exercises after lecture, 5) homework exercises, 6) unit review 
exercises, 7) multi-units review exercises. In these 7 stages, 
stages 1, 2, 3, 4, 5 include contents of a single knowledge 
concept, but stages 6 and 7 include exercises of other knowl- 
edge concepts in order to review. As IPS is a self-directed 
learning system, a student can choose any teaching unit to 
study. In a unit, she can also exit current stage or the whole 
unit at any time. The system records the student’s learning 
duration in each stage, the exercises she practices, and her 
results, i.e., whether or not the answer is correct. 


In IPS, each exercise has three knowledge concept tags, 
which are provided by experts. The knowledge concept tags 
have a hierarchical tree structure. For instance, for one ex- 
ercise, its 1st, 2nd, and 3rd level concept tags are "Num- 
ber Theory”, ”Prime Number and Composite Number”, and 
”*Decomposition of Prime Factor”, respectively. 


3.2 Data Set and Data Pre-Processing 

We use a sample of anonymized student usage interactions 
from the 5th grade math curriculum in IPS. We choose exer- 
cising records whose first-level knowledge concept is ”Num- 
ber Theory”, which has 7 second-level knowledge concepts 
and 15 third-level knowledge concepts. We further choose 
students whose exercise records include at least 5 exercises. 
The resulting dataset includes 44,128 exercise records of 
7,124 students. 


4. KNOWLEDGE TRACING MODEL 


We now introduce our knowledge tracing model based on 
DKVMN model, and highlight our improvement in aspects 
of memory structure, knowledge concept weight, and read 
and update process. 


4.1 Concept-Aware Memory Structure 

We modify DKVMN to design its memory structure based 
on the course’s concept list. Fig. 1 plots the model’s struc- 
ture, which is based on DKVMN model [13]. As shown in 
Fig. 1, Mf is the concept embedding matrix whose size 
is M x N, where N is the number of memory locations, 
and M is the vector size at each location. We set N equal 
to the number of the course’s knowledge concepts. As we 
have 1 first-level knowledge concept, 7 second-level knowl- 
edge concepts and 15 third-level knowledge concepts, we 
have N = 23. Then, in each location, the student’s state for 
the corresponding knowledge concept is saved. Thus, the 
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Figure 1: Concept-aware DK VMN model structure. 


model’s memory architecture is explicitly designed to rep- 
resent knowledge concepts. For comparison, N is a model 
parameter in DKVMN representing the number of latent 
knowledge concepts, e.g., N = 5. As our model is in- 
spired by DKVMN, we name it Concept-Aware DKVMN, 
ie., DEVMN-CA. 


4.2 Knowledge Concept Weight 


As a student’s state of a knowledge concept is saved in the 
corresponding memory location, when a new exercise ar- 
rives, only the exercise’s related concepts’ memory locations 
are retrieved and updated. We now present the details of 
such a procedure. In this section, we calculate the knowl- 
edge concept weight (KCW) of the exercise. The weights 
will be used to calculate the weighted sum of a user’s cur- 
rent knowledge concept states to predict her performance 
on the exercise. It will also be used to update the student’s 
knowledge state after obtaining the answer result of the stu- 
dent on the exercise. 


We first obtain the embedding of the arrived exercise. As 
shown in Fig. 1, when an exercise q, arrives at time t, it is 
first transformed into an embedding vector m; through an 
exercise embedding matrix A. We then calculates the KCW 
through Algorithm 1. As shown in Algorithm 1, at line 2, 
we initialize the weight list R. As each exercise has three 
knowledge concepts, the length of R is 3. Then, for each 
knowledge concept & (line 2), we calculate the dot product 
of the embedding of the exercise (i.e., gq) and the concept 
embedding (line 3). We then calculate the KCW by ob- 
taining softmax of R, with Softmax(zi) = a Sen es 
(line 6). Then, we initialize an all-zero vector Weight whose 
length is the number of the knowledge concepts N/(line 7). 
For each knowledge concept k of the exercise (line 8), we set 


its weight value in Weight. 


In summary, DKVMN computes the relationship weights be- 
tween the exercise and all latent knowledge concepts, but we 
just compute the relationship weights between the exercise 
and its knowledge concepts. For the exercise’s relationship 
weights with other concepts, we set them zeros. 


Algorithm 1 Knowledge Concept Weight Calculation 


Input: 
q;: embedding of the exercise arrived at time t 
Ky: knowledge concept list of qt 
M?®: the concept embedding matrix 
Output: 
Weight: Knowledge concept weight of the exercise arrived at 
time t 


/* Calculate KCW x/ 

R<=[] 

for each n € Kk; do 
corr = m7 . M¥{[n] 
R.append(corr) 

end for 

Rs = Softmaxz(R) 


/* Reshape the weight vector to make its length equal to the 
number of concepts / 

7: Weight < (0,..., 0] 

8: i<0 

9: for i<3do 

10: = Weight; [i]] = Rs [i] 

dl: ieisl 

12: end for 

13: return Weight 


4.3 Read Process 

We then use the obtained KCW to calculate the weighted 
sum of the user’s current knowledge concept states to predict 
the student’s performance on the exercise. Denote KCW by 
w, we have rz = yo wiM?, i.e., the knowledge state of 
concepts related to the exercise q. 


We further concatenate r; with the embeddings of the exer- 
cise’s difficulty and stage feature, i.e., d; and g,. The result 
then passes through a fully connected layer with activation 
function Tanh to get a summary vector f;, which contains 
all information of the student’s knowledge state related to 
q, and the exercise’s features, i-e.,: 


f, = Tanh(Wo [re, de, &, me]) 
where Tanh(zi) = (e*! —e"*')/(e* +e°*). 


Finally, f; passes through a fully connected layer to output 
the probability that the student would do the exercises q, 
correctly. Denote the probability by p, we have 


p = Sigmoid(W{ f;) 
where Sigmoid(z;:) =1/(l1+e **). 


4.4 Update Process 

We then use the KCW to update the student’s knowledge 
state after observing the her answer result. The update 
process updates the value matrix M?, which represents the 
student’s current state of knowledge concept k. Our model 
is different from DKVMN model in that we consider the 
student’s exercising duration in the update process. For 
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comparison, DKVMN ignores this student behavior feature. 
Specifically, the work in [8] proposed that the a student’s 
duration of solving a problem is related to her master level 
of latent problem solving skills. Inspired by this work, in 
our model, after a student finishes an exercise, her answer 
result (i.e., correct or wrong) a¢ and exercising duration are 
used to update M?. Because the exercising duration is a 
continuous variable, it is firstly discretized according to its 
distribution and then represented by its embedding t. We 
then concatenate t with the joint embedding s; of the answer 
vector (q,, az), to update M/?, as shown in Fig. 1. 


The other update process is same as that of DKVMN. It 
includes erase subprocess and add subprocess. Erase vector 
is computed as e = Sigmoid(E™ |s;, t]), where E is the erase 
weights. Add vector is computed as a = Tanh(D7 |s:,t]), 
where D is the add weights. Then the new memory matrix 
M/?., is computed by 


My+1(¢) = Mz (#)[1 — w(i)el[1 + w(i)al 


The parameters of the model are learned by minimizing a 
standard cross entropy loss between the predicted user an- 
swer result p; and her true result yz: 


L=—J-((yelogpt) + (1 — ye)log(1 — pt)) 


t 


In summary, compared with DKVMN, we design the model 
structure based on the course’s concept list, and then explic- 
itly consider the exercise-concept mapping relationship and 
other exercise’s features during students’ knowledge tracing. 


5. REINFORCEMENT LEARNING BASED 


EXERCISES RECOMMENDATION 
Based on the DKVMN-CA student knowledge tracing model, 
we build a student simulator which provides environment 
for reinforcement learning, and train a personalized exer- 
cise recommendation agent with deep reinforcement learning 
method. 


Similar to [10], we model the recommendation process as a 
Partially Observable Markov Decision Process (POMDP), 
where the model state is the student’s latent knowledge 
state and the action is the recommendation of an exercise. 
At time t, the reinforcement learning agent cannot observe 
the student’s latent knowledge state s;. Instead, it can 
observe the student’s exercise and answer result (i.e., cor- 
rect or wrong) 0; which is conditioned on the latent knowl- 
edge state p(o:z|s:). Thus, at time t, the agent needs to 
recommend an exercise a; based on the student’s exercis- 
ing history before t, which is denoted by hi. We have 
he = (01,01, 02,@2,...,0+-1,@+-1). After the student fin- 
ishes the recommended exercise at, her latent knowledge 
state will turn to $441 by a transition function p(sz+1|s¢, at). 


We define the reward r; of an action a; as 


1 K 
n= Ke Pea); (1) 
i=1 


where K is the number of exercises , and P,+1(q) denotes 
the probability of the student getting exercise q correct af- 
ter finishing the recommended exercise at state s:+41. It is 


predicted by the student simulator. So, we name it as the 
student’s Predicted Knowledge. 


The purpose of optimization is to maximize the reward R of 
policy 7: 


CO 
t-1 
R= E,[5 yr (se, ae)]; 
t=1 
where trajectories T = (81,01, 01, $2, 02,@2,...) are drawn 


from the trajectory distribution induced by policy z : p(s1) 
p(o1|s1)m(a1|h1)p(s2|s1, a1)p(02|82)m(a2|he)..... Thus, as for 
the action-value function Q", the reward of the recommended 
exercise sequence at t is: 


co 


Q" (he, at) = Es, in, [re (se, ae) |+Erseny ar oe 4'r(st4i, t+) 


i=1 


where 7 > t = (S¢41, 0t41, @t41..-) is the future trajectory. 
The algorithm then recommends the exercise q which has 
the maximal reward, ie., @ = maxaQ"(hi,a). Similar 
to [10], we approximately solve the POMDP using Trust 
Region Policy Optimization (TRPO) algorithm [11], with 
an off-the-shelf implementation from rllab [4]. 


6. PERFORMANCE EVALUATION 


In this section, we present the performance evaluation re- 
sults of our system. 


6.1 DKVMN-CA Knowledge Tracing Model 


We evaluated our model on our IPS dataset. To evaluate 
it, we conducted 50 experiments. In each experiment, we 
randomly split the users into two groups: training users and 
testing users. Their percentages are 70% and 30%. We then 
trained the model with the training users and evaluated the 
model on the testing users. Similar to [9], we use area under 
the curve (AUC) as the performance metric. We report the 
maximal, mean, and the standard deviation of the testing 
users’ AUCs of all 50 experiments. 


6.1.1 Efficiency of Concept-Aware Design 

We first report the efficiency of designing the model’s archi- 
tecture based on the course’s knowledge concepts. We com- 
pare the performance of DKVMN and DKVMN-CA without 
the help of other input features, including exercise difficulty, 
stage, and duration. The results are shown in Table. 1. See 
the rows "DKVMN” and "DKVMN-CA”. As shown in Ta- 
ble. 1, our model obtains an AUC of 0.724, which is higher 
than that of DKVMN model, i.e., AUC = 0.712. Such an 
improvement is considerable, considering the small improve- 
ment DKVMN provides over the DKT baseline (AUC = 
0.711). Such a result means that the design of the model’s 
architecture based on the course’s knowledge concepts is ef- 
ficient. 


To highlight the necessity of our design, we also evaluate 
another method which also uses exercises’ knowledge con- 
cept tags, i.e., represents a knowledge concept by its em- 
bedding and then concatenate it with the embedding of the 
exercise. Its’ performance is shown in Table. 1. See the 
row "DKVMN-KC”. As shown in Table 1, its mean AUC is 
0.714, meaning a very small improvement over the DK VMN 
(AUC = 0.712). Thus, it is necessary to design the model’s 
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Table 1: AUC of Models with Different Features 
AUC | Max AUC | Variance 
711 
0772 
0714 
ee 


0.730 1.48e-05 


sage ihe 
0.725 0.737 1.75e-05 


+ 0.720 0.730 44e-0 


i + 0.726 0.739 2.43e-05 
Stage, Duration 


architecture based on the course’s knowledge concepts to 
fully utilize their capacity to improve the model. 


6.1.2 Efficiency of Other Exercise Features 

We then evaluate the efficiency of adding other features, 
including exercise difficulty, stage, and duration. The re- 
sults are shown in Table 1. See the rows "DK VMN-CA + 
Difficulty”, "DKVMN-CA + Stage”, and "DKVMN-CA + 
Duration”. As shown in Table 1, these features can further 
improve the model’s performance. For example, the mean 
AUC of "DKVMN-CA + Stage” is 0.728, which is higher 
than that of DKVMN model (AUC = 0.712). 


6.2. Exercises Recommendation 


6.2.1 Evaluation of Students’ Knowledge Growth Pro- 


cess 
We use the Expectimax algorithm proposed in [9] as the 
baseline algorithm. In the Expectimax algorithm, the sys- 
tem first calculates a student’s predicted knowledge assum- 
ing an exercise is recommended to the user to practice. It 
then chooses the exercise with the highest predicted knowl- 
edge to recommend. 


To compare the two algorithms, we first randomly pick 15 
students in our dataset. For each student, we conduct two 
experiments, one experiment for one algorithm. In each 
experiment, we first initialize the student simulator using 
the student’s historical practice sequence. Then, we con- 
tinuously recommend 50 exercises to the student simulator 
with the recommendation algorithm. During the process, 
we record the average of the 15 students’ predicted knowl- 
edge as Eq.(1) at each step of recommendation. The results 
are shown in Fig. 2. As shown in Fig. 2, the students 
served by RL policy has a higher mean predicted knowledge 
than the students served by the Expectimax policy after 50 
exercises. Moreover, after about 10 exercises, the mean pre- 
dicted knowledge of the students served by the Expectimax 
policy stops increasing, meaning that the policy cannot find 
exercises which can help the students to improve the perfor- 
mance any more. For comparison, served by the RL policy 
which considers long-term reward of action, the students’ 
mean predicted knowledge keeps increasing, meaning that 
the RL policy still can find exercises which can help the 
student to improve performance. 


Predicted Knowledge 
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Figure 2: Performance evaluation results 
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Figure 3: Knowledge State Variation 


6.2.2 Evaluation of Recommendation Process 

We design another experiment to observe the recommenda- 
tion behavior of the RL policy. We randomly pick a student 
who has practiced five exercises, and use her exercise se- 
quence to initialize the student simulator. Then, we serve 
her with five more exercises using the RL recommendation 
policy. Fig. 3 shows the results. The x label of Fig. 3 shows 
the 10 exercises’ IDs, concepts, and results. For example, 
the first record (88, 5, 0) means that the exercise ID is 88, 
which is related to concept 5, and the student’s answer is 
wrong. As the 10 exercises are related to 6 concepts, we 
plot the student’s predicted knowledge of each concept in 
Fig. 3. For instance, as the student fails in the first exercise 
88, which is related to the 5th concept, the student’s knowl- 
edge status on the 5th concept is relatively low. The status 
of the knowledge concepts not covered by the student’s his- 
tory exercises are indicated in black. We now observe the 
recommended exercises. We have the following observations: 


e As shown in Fig. 3, the first five exercises are related to 
concepts 2,4,5, and the later five exercises are related 
to concepts 1,3,6, suggesting the RL algorithm wants 
to explore the student’s capacity in other concepts. 
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e After the student succeeds in exercise 760, which is 
related to the concept 3 (Decomposition of Prime Fac- 
tor), the algorithm recommends the exercise 642, which 
is related to concept 6 (Maximum common factor and 
Least common multiple). As concept 6 is related to 
concept 3, such a recommendation is reasonable. 


e The student, however, fails to finish the exercise 642. 
Thus, the algorithm recommends exercise 642 again. 
This time, the student succeeds to finish it, meaning 
that the model captures the phenomenon during train- 
ing that a student who failed in exercise 642 may suc- 
ceed if she retries. Such a result is interesting. 


e After the student succeeds in exercise 642, which is 
related to concept 6 (Maximum common factor and 
Least common multiple), the model’s estimation of 
the student’s capacity on concept 3 (Decomposition 
of Prime Factor)also slightly increases. As these two 
concepts are indeed related, such a result is reasonable. 


e Then, the algorithm turns to another concept again, 
i.e., it recommends the exercise 1278, which is related 
to concept 1. While the student succeeds in the exer- 
cises, the estimated student’s knowledge status on the 
concept 1, however, is relatively low, suggesting that 
the exercise is relative easy. 


e At last, the exercise 760 is recommended again, and the 
student succeeds in it. As a result, the model’s estima- 
tion of the student’s capacity on concept 3 increases, 
suggesting that reviewing is beneficial for study. 


7. CONCLUSION 


In this paper, we improve DKVMN by designing its neu- 
ral network structure based on a course’s concept list, and 
explicitly considering the exercise-concept mapping relation- 
ship during students’ knowledge tracing. We also enhance 
the DKVMN model to consider more input features. Ex- 
perimental results show that our model has higher perfor- 
mance than existing deep knowledge tracing models. We 
also propose an exercises recommendation algorithm which 
uses model-free reinforcement learning with neural network 
function approximation to learn an exercise recommenda- 
tion policy that directly operates on raw observations of a 
student’s exercise history. Our experimental results demon- 
strate that our policy achieves better performance than ex- 
isting heuristic policy in terms of maximizing the students’ 
knowledge level. To the best of our knowledge, this is the 
first time that deep reinforcement learning has been applied 
to personalized mathematic exercise recommendation. 
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